﻿ Semanc Annotaons in COROLA, the corpus of Contemporary Romanian Language Dan Cristea COROLA • Contemporary Romanian: 1945 => nowadays • Registers: belletrisc, drama, essays, poetry, journalism, scienﬁc • Domains: as many as possible • Representaon: perhaps proporonal with the reading density, more than with the wring density • Heavily annotated on more levels: – compulsory: sub-syntacc, morpho-syntacc – oponal: syntacc, semanc, discourse COROLA • Who will be the beneﬁciaries? – language researchers – professors & students – Romanian language users – writers – historians of language (imprint of the language at a certain moment in its evoluon) Goals of our endeavour • Propose annotaon convenons for the semanc level • Develop a lexical-semanc knowledge base to be used by Natural Language Processing (NLP) tools to learn how to reproduce the same type of annotaon on other texts • Invesgate the level of trust on these annotaons “Quo Vadis” • Romanian translaon of Henrik Sinkiewich’s Nobel price novel – translaon by Remus Luca and Elena Lință, Tenzi Publishing House, 1991 Development of the “QV” Corpus #sent #enes #ref links #AKS Sept 2013 1,500 3,663 2,045 75 May 2014 7,150 22,310 21,439 1,133 June 2014 7,281 24,643 21,605 759 [Sept 2013] D Cristea&E Ignat: Linking Book Characters Toward A Corpus Encoding Relations Between Entities, SpeD Conference, Cluj-Napoca [May 2014] D Cristea, D Gîfu, M Colhon, P Diac, A D Bibiri, C Mărănduc, L A Scutelnicu: Quo Vadis: A Corpus of Entities and Relations, in “Recent Advances in Language Production, Cognition and the Lexicon”, Springer (to appear) [June 2014] A D Bibiri: “A Language Resource: the Quo Vadis Corpus, master thesis in CL”, Univ Iași (to be delivered) Many false relaons eliminated The annotaon process Extremely complex: – done originally by students in CL on chunks of text – veriﬁed by a nucleus of experts – sewed at chunks boundaries – ﬁnally re-veriﬁed by an expert – ﬁlters applied Layers of inial annotaon • sentence borders • tokens (words or word compounds, punctuaon marks) • part-of-speech tags • lemmas What do we annotate? • Singular and group enes of type PERSON and GOD and body parts of them – Individual: [Ligia], [ea], [Marcus Vinicius], [împăratul], [al lui Petronius], [al lui], [un grup mare de credincioși], [care], [Christ] – Imbricated: [alte femei din [societatea înaltă]], [[his] hand], [[[his] mother’s] sister] – Realisaon included: dar 1:[îl] și 2:[iubeau, REALISATION="INCLUDED"]) din tot suﬂetul What do we annotate? • Relaons between enes – Referenal: • corerefenal: 1:[Ligia] 2:[tânăra libertă] • part-of: 1:[mâna 2:[lui] dreaptă] => part-of • isa: 1:[nașă] să-mi ﬁe 2:[Pomponia] ⇒ isa • … What do we annotate? • Relaons between enes – Aﬀecon: 1:[ 2:[lui]] ⇒ friend-of ; – Kinship: 1:[ 2:[lui Vinicius]] => parent-of – Social: 1:[Tânărul] luptase 2:[lui Corbulon] ⇒ inferior-of Relaons in Corpus love ve: ec Aﬀ f r-o rio pe : su cial So